Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents
نویسندگان
چکیده
In this paper we discuss a new model for document clustering which has been adapted using non-negative matrix factorization method. The key idea is to cluster the documents after measuring the proximity of the documents with the extracted features. The extracted features are considered as the final cluster labels and clustering is done using cosine similarity which is equivalent to k-means with a single turn. This model was implemented using apache lucene project for indexing documents and mapreduce framework of apache hadoop project for parallel implementation of k-means algorithm. Since experiments were carried only in one cluster of Hadoop, the significant reduction in time was obtained by mapreduce implementation when clusters size exceeded 9 i.e. 40 documents averaging 1.5 kilobytes. Thus it is concluded that the feature extracted using NMF can be used to cluster documents considering them to be final cluster labels as in k-means, and for large scale documents, the parallel implementation using mapreduce can lead to reduction of computational time. We have termed this model as KNMF (K-means with NMF algorithm).
منابع مشابه
Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing
Clustering of class labels can be generated automatically, which is much lower quality than labels specified by human. If the class labels for clustering are provided, the clustering is more effective. In classic document clustering based on vector model, documents appear terms frequency without considering the semantic information of each document. The property of vector model may be incorrect...
متن کاملA Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization
Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem. At each step of ALS algorithms two convex least square problems should be solved, which causes high com...
متن کاملBig Data Summarization Using Semantic Feture for IoT on Cloud
Data management is a crucial aspect in the Internet of Things (IoT) on Cloud. Big data is about the processing and analysis of large data repositories on Cloud computing. Big document summarization method is an important technique for data management of IoT. Traditional document summarization methods are restricted to summarize suitable information from the exploding IoT big data on Cloud. This...
متن کاملEnsemble Non-negative Matrix Factorization for Clustering Biomedical Documents
Searching and mining biomedical literature database, such as MEDLINE, is the main source of generating scientific hypothesis for biomedical researchers. Through grouping similar documents together, clustering techniques can facilitate user’s need of effectively finding interested documents. Since non-negative matrix factorization (NMF) can effectively capture the latent semantic space with non-...
متن کاملTextmining and Organization in Large Corpus
Nowadays a common size of document corpus might have more than 5000 documents. It is almost impossible for a reader to read thought all documents within the corpus and find out relative information in a couple of minutes. In this master thesis project we propose text clustering as a potential solution to organizing large document corpus. As a sub-field of data mining, text mining is to discover...
متن کامل